Snowflake Bulk Ingest with Storage Integration

In big data processing, sometimes it is required to load huge amounts of data in batches. In this scenario, bulk ingestion is extremely useful. Snowflake Bulk Ingest when used in the data integration stage, helps you load batches of data from files available in a data lake like Amazon S3. Bulk Ingest loads chunks of data every time you run the pipeline. Data is first pushed into a landing layer, and then sent to the unification layer.

Calibo's Data Pipeline Studio (DPS) supports bulk ingestion of data using Snowflake bulk ingest in the data integration stage, S3 as data lake in the data source stage, and Snowflake as the target data lake. Following is an example of a Snowflake bulk ingest data pipeline:

How Snowflake Bulk Ingest works

In case of Snowflake bulk ingest, every time you run the data pipeline, data from S3 is ingested into the data lake. During data ingestion, the data is first pushed into a landing layer. Depending on the use case, you can perform operations like append, overwrite or merge on the data. The processed data is then pushed into the unification layer. During this process credentials are required to access the Amazon S3 bucket. Read, write permissions are required to Snowflake objects. You can avoid sharing credentials directly and instead use Storage Integration.

What is Storage Integration?

Storage Integration is a Snowflake object that helps you to connect to the AWS account from Snowflake using the IAM service. You can specify allowed and blocked storage locations. This way you can provide enhanced security to the complete data ingestion operation.

See Configuring a Snowflake storage integration to access Amazon S3.

Prerequisites for using Snowflake Bulk Ingest in the data integration layer

Ensure that you meet the following prerequisites:

You must have Amazon S3 and Snowflake data lake configured in the Calibo Accelerate platform.
You must have a storage integration created in Snowflake.

To create a data integration job for Snowflake bulk ingest

On the home page of DPS, add the following stages. Your pipeline looks like this:
1. Data Lake: Amazon S3
2. Data Integration: Snowflake Bulk Ingest
3. Data Lake: Snowflake
Configure the Amazon S3 and Snowflake nodes.
Click on the data integration node and click Create Job.
For the data integration job creation, provide the following inputs:

`What's next? Snowflake Stream Ingest with Storage Integration